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Abstract of JP2001312293 

PROBLEM TO BE SOLVED: To recognize a 
voice with a small computation quantity without 
lowering the recognition performance as to a 
voice recognition technology. SOLUTION: This 
technology has a step wherein a phonetic 
notation series generated by merging 
phonemes having similar features of phonetic 
representation of an object vocabulary set to 
be recognized is converted into a voice 
segment series as a minimum unit of 
recognition and it is expanded into a phoneme 
merging voice segment tree, a collating step 
wherein previously found standard patterns 
representing features of a voice are connected 
according to the phoneme merging voice 
segment tree and collated with the feature 
vector time series of an unknown input voice 
signal by DP matching using a beam search 
over time matching, and a step wherein the 
standard patterns are connected according to 
a voice segment tree for re-collating if a result 
is not uniquely determined and then collated 
with the unknown input voice and the 
recognition result is outputted. Consequently, 
a voice can be recognized with a small 
computation quantity without lowering the 
recognition performance. 
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i w*« i ] *& Aa**«***»#*f l&®^7 

h /UB*?l] 5 * T y •? t , BB4* &B&i? yhO 
*BBB0W«©(HT^5*Bfc^--^Lfc*BBIB?iJ 
trB«©*/hB«refc3WfflfcMfcU inSrW* 

FflS-a-Sr® 5 4* fcfr 5 *x y7*i , B8^»***-*lw 
fc^B^KiBBBJftfctttfJ-*-**^?:/*, JR-frB* 

5B«^B**ttttl1-5*T Sif.^ffl©^ 
Jtyy-SrlSBI-S^xyT'i, WMUMWWfyy- 

&J:0i*IIII»^*»9^e)m^l8«WS*S:ltJ*r5^ 

IB** 2] wx-v-s^p^yy-jcMBi-s^y 20 

*SMlEMtrB»©*/M|ltt-e*>S*)»>i-3fllK:*»U. 

rnsr**^— -^^y y -wsM-ra c t zw®t 

»T^-^6r**W«fc-f-5B*Jli*fcli2|E« 
V fl*Hm S^TyT"^ BBftfeBB* y h © 30 

fcSMSEU rni^iiA^troi^:, t'-A-tf— 
f-fcfflVVfcD y f-Vy*Kl i. <9 mmm^^WO fc#fc 

t-TSW^BBS^ffi. 

IB** 5] ^AAfjsfif^lMtll^^ 

*BBB*j©BB>e>BNB a b*©«. so 
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-tJgM1-5^xy7'i:, miiE7 y*Pfty y -icfto 

7»J*# l oo^tffcfc 9 icd^fiEim-JICS^IfB 
1 ofcfc t) lc*»4»5SfBl:fiJt'<fi8Bl-5 r t S: 
W»ti-5»**4 *fctt5E«©**B«#ifc. 
IB** 7] B«*a^*^©**v«^-vtt, B 

©WFBB^ft. 

Mil, 7^FJt&*'<*-V©7U-A&£'>ft<-f 

5 r. t 1 1"5B*« 6 IB«0*FB«#ifc. 
[B*9 9 ] *^oE«H-*fc^i»5 ^ * h ©H'M 

#*©?q v 1-ftfc^#7*#^©^£#*Tifi<HT-#5 

(O^^'pti < -T5 r t &4#«fc 1-5B*35 6 ISBto 
^^BB*i5fe. 

B*« 9 E«©lF)*rBB*ife 0 

IB** 1 1 ] SB*©!*, **PA*#?f©Bip»» 
i:©*B^«rffv>BBB**ttl*i-5r t«r»«i:i-5 
B**l, 4, 5©v^i*ix*»KiEB©*J*BB*jSfe. 

[B**12] *fcAa*F©BJ»KB 

5B*«i, 4, 5©v^-fna»ltB«o#j»BK*ife. 

in** 1 3 ] **q Ati^po^p^m fc^jt^i", 

»46*6«*B»5BBDP^yf'ViySrfflv^r i:«rW 
«ti-5B*«l, 4, 5©v^n*»fcEilO*WSB 

lit** 1 4 ] m^x^pm ^z^wfttf- uitm^ 

h^BfcWSrJfrfcStfWMff^Bfc, BB*f*B*-ir 

tts^-Jtwy y -[csw-r^y y -mm^& 
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^K«t 9 BtP^S^lS 9 fctf t»fT 5 BS^Si s 

fry y-MHrssste-jByy sn£ 

^ftijfcfciAA*** r-A-tf— f-fcfl!V*fc. 

D P -r y ?■ v 9 iz. i. 9 mffl&G * ft 9 ft # b ffv >fgg£ii£ 

1 5 ] *toAA*^««-**»»«f 
^fc#^»**J*fr©**lfc3E*U :HS:77t^ 

y y-(cgp-t-5^y-®i¥©i^ tuia^^rfry 

5sn?^*ittta#at, ff.Be^fflow^yy-sr® 
Burssia^yy-SHi^at, w&MWWfrs' y - 

feA^^i ©!$-£•£, ^-At-ffcffiv^D'PTjp 
v/lc J: 9 B£fWg£-£& 9 ft/5 5 feffv^WBfcBfSSrffl* 

1- 5 nm^m art* wr&i-s *pww& 

1 6 ] **nA* : SF*ffi-l§-fc : ff»£*r LWR^ 
9 F/H*3R?ll**ft5iP»»4lf*ai, HR#*B*-fc 
y fc«N#B©***-C«\ .fit 

Soffl^tPit»^- ^^07 7**fr<03R5!IIC 
Se*U »N#B«l*©*#fc»«fc*J»fr«llM*- 
v£&offigr§Wt©^iJf^&U rftfc^^fr 

yy-icusgs-rsyy-SM^gt, ite?7*^y 

1"* J*fr * - v*5 J: tf» *^ - SWfr *3K * 

B#^?IJ t ©!&£•£, t'- ^£fflV\fcD P -r y^-S 
^ ic «t 9 ttm S-g- £ © 9 ft * s & ft v 'SOUS* «r ta*1" 5 

t, *&A#*wt***»#WL*NK'<* 



(3) #112 00 1-3 1 2 29 3 

4 

*¥4fre*.5**rfraiK:**U rftttiT-Jt)" 
*r**^-* i ff*frs*'<*->'8:8ttu 

toA^W^flr*©**^^ Wm#S?iJ £©$££. t*- 
A-y— yfcfflufcD P7 y Vi/lci 9 nSIWSte-SrK 9 
ft# bfTpxfyT'h JS£»*#-«twfc* 

g4r«iti-f-3*Ty7*i, WS^ffl©*^^ y -£g 
BH-**x> 7 t> W»#flHW*fryy--fcttoT** 

20 T, *»A*iP)''flr*Sr : ff»»«fL«r«'<^ WW#£?iJ 
OR«i»e>ISl*HO*«*-C«r» ffl«©fiv*Wfr« 

fr©3R?«fc:«*U riifr?7*^yy-icj|«ll-* 

30 A^-^^rfflV^ciDP^s/^^i/ICt 9B#Psgg£-£&9 
ft i5 5 * r y 7" t , fi8^o»*i» b SRS^iSrtT ? ft 

®btizmm%mm*tem-tz>*7-v7b. nm&R 
<D%j*K>y»-zmmi-zxTy7b. nm&m^pR- 

^t**DA**J*fioBB^*:v t'-A^-^-SrffiVNfcD 

Sr mt>-r 5 ^ x y -? b z m-t z z t t -t % => y f 
=. - 9 9 -sim ft 

T, **PA*#^{f-g-Sr^iP^«fL1#®-<^ h/H*»5! 
5 * T ' y 7 s b s y h ©^^^ISJIJ 

fr©^jijK£&u rix^y^^^yy-fcgM-t-s 
^fy/i, if27 7t?)j-y y-tseoT, hbfr'c 
&#ibT&^it&p<Dim*$t'r : gpfrMm'<9-i'}o 
xu^m^-'y^pnm^<9->^mm.^, r^t* 

50 toA***«*<0tt«^* h/M*3RHfc t'- 
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ft* ^«feSJtU*t5^f7 r «t 

1 1" 5 = ^ t' ^ - ? <9 Wll ft f2lt&ft. 

10 0 0 1 ] 

y ^ >- £ i > fc^BildJ £*3 X V* (Dm 

[0 0 0 2] 10 

fc'-Air-^fflWcDP-r-y^i/tcj; 9 
BfF,g8££ISl9 fttf^ff^BSSH-S^B^&i: L 
X, B*tfP^it^m¥^9¥9^, 3-1-4 r^pjf 

[0003] mi6ii, eejR^^^^so ^ y t° =. 

t, 1 it^P^Kl&ts^^ V , 2fiA/D, 3fi-Y^ 20 
^7i-^ (I/F) , 4(i^*U, 5I2CPU, 6li 
*r— if,— K/f^^/W> 7liCPU^, 8 fi I / 

f, 9ttffi^, i o n^n^m^-t y y . 1 5.fi#* 
tfauM*-:^ 1 9 nw^tfy y --<?*> s. 

[0 0 0 4] ±C<0 i 5 fcfcfifc SftfcfcSEOtfMSKS 

[0 0 0 5] «$/^->Ol|ittitt ( * 30 

CV/VC , V 

CV, CVC*i?^if)Jx5. m&B«o*/h*ffi 

*-e**-t-cvt» **f &a»&**ffc»*-c**-*-v 

USftSVCiJt^-f-S. 

[0 0 0 6] fci x.fi, Btt***B&& r# »j (4 bj 
fcfc**^"?*-*-*, 0 4 ©±5 left 5. 

[0007] ztiz&f&tey v -mmx-&Lith<Di>m 
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[0 0 0 8] JJIT, ^JfEGilKo^T, 01 7©7D-f 

[o o o 9] w*xmifr*9—i'i 5f±, *>e>i»t»# 

-Vffl7 U-^rtlC^^^^OJpt&ffi^^ h/USi 

[ooioi t^yy-1914, fctj^cfeBsa^ 
[ooii] t-f, **##r«ns o i a** tut. 

Til, H^ffl^tfftiaLPC^^ h5A«ft, L 

p c y /K7-7*^ h 7 a^sc> y /nm^auaw k i 5 y 

^KiS^HRStfr?** F7Afl» (MFCC) ft 
if, ffPBKtti Lfc t> ©ft fe tf ^ o <fc 3 ft t> © fcfliv * 

[0012] i^mis o 2 -cii, w/i-y y - 1 9 (C 

Lfc#ot#^rfr*»/<#-^l 5&»«Lft#fc, i 

t)^H^3T(OiS>6>ofcy-7y-KS:*«>, rcoy- 

[0 0 13] WTIC, DPv^y^i^l^h f 

[0 0 14] DP? yf-^{f\t. ^ttPtm*'**- 
^<D#fm&&* k <0 ft* bR8^1-5*ftTfc5. 0 j S 
iOA737U-^i:, ^pft-y y -OlkSBOT-? 
lC»Jt-r 5 fWmClf/< ^ - y ©^ i #a©7 U- 
AiO»«^37L (i, k ; j) 14. ^CD«{t:^;-e^ 
^^5= fcfc'Ld (i, m; j) ttA^co^ j 7 U-A 

2»o 

[0 0 15] 
[iCl] 



(5) ($M 2 001-312293 
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l (o.o;o) =o 

(for j=l,—,J) 
Cfor k«l.««.K) 
L (0,k;j-l) ~L (Ik , k 

^(fbr i«l,"\Ik) 

{L (i,k;j-l) + d <i, m ;j) 
L (rl,k;j l) + d (i, m ;j> 



[ 0 0 1 6 ] *WT»K, U-7y-K©S8(X37 
[0 0 17] t'-A-^-^li, DPv;/fyyoi73 

[0 0 18] A737U-APffl(C, glTO^CLfc^o 
T, ( i , k ; j ) Srft*jaa\ mmt LT^"t" 

[0 0 19] 
[£&2] 

e=max{L (i,k;j-l) } — a 
i.k 

L (i.klj) geo^s. *<Dtt^AI*ai-. 

[0 0 2 0] f-Af— ^fcEV^DP-^j/^^-m, 
*»*v\fc«>» &>"J9W:fc$ 0frfoii*V\ ^rUT^F^ 

[00 2 1] Btt#ftB*ft#£i^«£-, fisltt 

5. rntti-*fc*». &pmttitti5:-cte&m£m*£^ 



[0 0 2 2] Lfc^oT, 55J*H*&f+i!rT?tt, 

[0 0 2 3] »^ AiHllfcKitWLT. 0 

©K(c^2)ff3fl:fi(2ll80«t 9lc^ffB^b1-5. 018 

[0024] mmz%pffltettmx-<D t'-&m*m$>z 

[0 0 2 5] 

[0 0 2 6] #3PJ!li, ilS^rogSSSr^rStW 

•saw*:** < i-z$.iz&mmttft-?<Dm&icfrfrz>tt 
n&znmtz. •tttt>*>mm.m*xkz-rK.£.to<D 

[0 0 2 7] 

■ft&m-r-V&PKy y -£fflv^5 r. t i:±ct(*fl 
so y -5riv^:i \c£ixm&tfrfrz>ft&&*mi$Li- 
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5. 

[0 0 2 8] Ztltt. tT— Af— ^-Ctf-WS<£>#l^$g~ 
■ 0>lW¥8B#M:&^HI#«\ tT-A-0— fCftHf*©^* 

10 0 2 9] rllCiD, *^H*ftf+ie'COW-JI*i s 9J 

[0 0 3 0] 

[*W©Hlfi©JBtt] *3BW0»**llcflE«0»W 10 

*nA**F«*©1*«'«^ h/W#^i] f- 
ft#e>tT5*7 1 :y7*i:, RR***#-*K**5**IC 20 

->&««EU £ii i: i:©JR* 
Sr % fc*-Aih-^&^fcDP^^>";/tc4 9B#ffig 

y -©£#9 #/ha < ftsfcft, «5RffilB#/h3 < ft 9 
10 g ©RS£K:a»a»5tt**«r*«fcll!l»1-3 r t tfSf 30 

[00 3 1] ft *9 2 fclE*©»§Htt, m%m 1 I2gc© 

fc«*ffiWoi£^»SK+iS©y y -©£# 9 £5fcft3 d 
t Sfcft, »9|tsn^/h$ < * 9 10B ©fiS^lc 

ofct LTfc£*©ffttt:l!H*-e**i:v*3fW8*r* 

1-5. s ?> i a i (om-sx-n^m-r-itzft 

J:9»*©lHT^5IS*f4KBiJ*-fK:B»SrfT-5*:it>i 
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[0 0 3 2] 3 fclE*©»0Itt, 1 *fc(4 

*4 < thft*MJ*1-3 r t as-C# 5 £ V> 5 fftB £#1" 
5. 

[0 0 3 3] fjf#9 4 lCfE*c©3§IHl4, jfc»A#tfJ*{t 

Wm#;fc?iJ £©!#.£■£, tf-Al>— f-fcjflv* 

$>S!S£fcfT 5 t ft * WW* 

SrSM-J* 5 ^ f y r t , #fiS£-flj tf Ftf ^ y -left o T 

;*£©!?&•&&, f-^t-f Jffiv^:DP-?yfy/i: 
4 9 R#ffifi£-£I& 9 bffV^flMtttXtrtUAr S^x 
yftz^-tzhox-h*) . mmttj&x-mmBvmG 
-cttpfty y -ofiimi»iifwoiv>77iMi 

z±m\cmmi-?>zttfx'%. nm^n^xh^o 

tf 5 c ic4 9B«ti^^t ^ -rimm-tz zt&x- 

[0 0 34] lf*55 (ClSife^^H^ll, 5fe*PA*1Wi 

^^©#m^*-ri : ^^sp^^-vio4^^^-^ 

©iltm^^ hAH»3RJ!i: ©«[*«:, f-A-tJ— ^Srfflv> 
fc D P v y ^ ^ |c «t 5 B#p a 1S-g- Sr Si 9 ft ^ b fT v « 
^*^m^-r?)^7 L 5/7°i:$rW-t-5'b©T-fo9, #^>t- 

y y -©By^sc^^^ffiawffiv > 7 7*^r^«9/<^ - 
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[0 0 3 5] »*J56fcE*©»Wtt, tt*JS4Sfc« 
5E*t©tfj*BK#ftfc*sv*T, «f««rfiv 
Sp/N'^-^li, 7 7* 1 o©#J*tf- ifcfc 9 

'■qMSE-j-arfcfcflrfcfc-t 5>b©T';fc9, loctW 

[0 0 3 6] !f#JS 7 lCfE«©3P^ W#JS4£fctt 

»MBS*#-*i;:ft4 5«Hrt-?s Aft 
5*««»©*^f>i-Sr^-^i-5 r t -fi> fc© 

[0 0 3 7] f|*9 8 KBfcOSSWtt, «*9 6 !E«c© 

9 . left®* 4- BiJ»i- 5 d fc *s t? 1 5 t v * 5 flUB tr 
[0 0 3 8] ffl#Jf 9[ClE«©3glJJl4, ntftgeetto 
h©BiJ$l4, K/womiHW*** 

fc5, *F>t«!|i^^-^©^»^»*©«^**Bi]» 
[0 0 3 9] |f*g 1 0 (CfE«©5S W4, fjfjJtJS 9 fEiffi 

#**£*©##tMTH©tt«**'>ft < 1-5 r t ^!|# 
5. 

[oo4o] 1 1 KBttoaw*, m*m 1 . 

4, 5©v^ix*»l!:E«©*WBBii*jfe^i3V^T, WS 
*&A*#*©W¥fB£*©*«te**Ti^B* 
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3*R»t5 r b J: 9 . SfiS^-f5EW^< T-t-trfc 

[0 0 4 1] g|*9 1 2 K!E«©IPj§l4, if 1 , 
4, 5 ©^r^(-!E«©^7*B^felo::fc^T, SB* 

10 fe«>B^tt^o^k^>ft<Tl-tfi:V>5^ffl^1- 

[0 0 4 2] f|*3S l 3 KEi$©3§ejJf4, if #9 1 , 
4, 5<D^T^lcfEit© : trB§§i*St-iBV>T, 
A#**©»J»Kni]«:#£tf1\ HftSfciSfcBas* 
i^DP-r y > ^Srffl v r b ZftWt b -t 5 1 ©T& 9 , 
5H*fKIW*#J&Lft <Tt, BfSH-5 r. t ^T-t 5 1 ^ 

[0 0 4 3] if #JS 1 4 tc!E«©3PJl4, 

20 #Rt, y h©ff**1E©W«©(H-CV^ 

wn$^Jt ©$£•£, f-M,f— 
f- *m v *it d p v > a 9 ^ras^- ^ m v ft a* 

ft 5 t . 5 

b, -SfcfttS^tBBfcJBSSrtH^-t-SWje^ik 
30 t, BR^»*^-*l^*e>ftii>ofc»frlc, SflB^-S: 
tt 5 <i«<!: ft 5BI»#ftBltSr«iUl-f S?f JS^-«*ltta 

9 ft^ btTv^BW**ftffl*i-*W»^*«fc tr 
W1-5t©T, **Sr-7-i?i-5ri:JcJ:9*^yy 
-©J£^ 9 a** $ < ft 5 fcft, as*SM^/h * < ft 9 1 

40 »JH*«rfTofcfcLTt^*©tHI*fclW»-e*5 

[0 04 4] 1 5 KB«©*Wtt, **PA*ff>* 

N#I©^m*T-|r, fiia©av^^2p^^-^Sr 
^ft^^SUd^^-^Sr^ottSg^JtW^JIcSE 

50 TiJV^^roWliSraEi-W^K-WiP^^-vfeJ:^* 
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gpim-o®®^ v/m%mt(om^, if-A-y— 

f- £ft D P v y.f- > 9\z. J; 9 ^S-g- &® 9 ft ri* f> 

y y -^giii-sw^y y -sbb¥& 
nm&m^pfty y -K^xmm^p^mM^ 

f-yb&mL. ztik*toAti : gpk<D®&£, t- 
Af— ^Srffli^fcD P-7 y =J-^9\z.& 9 9 

<r>x\ ^n^ieT-oiiEis (Dms-x-tt^pKy y -om 

[0 04 5] W#3S 1 6 icUWwftWl-i, ^Xti^P 

fcswtii*'* * - y *&r>m®%PK<o&w km 
&u zhzyy^pfty v-^mffl-tzyy-mm^ 
m.t. WiE^fwty y-ic&o-c, hbfrCbXib 
x^^ptDwm^-r^pnmm^^-^xn^- 
^-v^PKmm^f-^zmm.^. 

f-£ffl^fcD p -7 y ^vy-ic i i? B#Kfi3-£&9 ft^b 
[0 0 4 6] BMt«l 7ICffi«<D3MllHU 7°ny*7^£ 

ftiz=i/\f=L-7-\z.£ix%p*mm-z>7*yyj±* 

»*-fey hoW^IEOWftOfiarv^***^- 

u ztiz&m-r-^pfry » -icmmi-zx^ ?y° 

/WB#^Ji©f$£S\ t'-^-y— ^5rfflv>fcDP-7yf- 
k, fcfc*»ofc»&|c, 

ftk*to*.Jl : gPk<om-&*, If- A*- ^fcfljivfct) 
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£ ill * 1- 5 * T y 7 s t £ # 1" 5 w k * m « t i" 5 > f 
i y 9 tf/hS < /.cafe*, 

[0 0 4 7] M*«i 8fciE*©*gilHU ^D^yA^ 
itfc = > f a - 9 C J: T*^f SrBtt-*- 5 7* a y* 7 A * 
10 EftLfcEftiflcffcot, *»A**J*«#*#»» 
#fb#g!t-<* h/H*3RyiJfc#»5*^3'7'fc, Bi&##l 

20 BtfXmkom&Z, tf-Af— ^Srffl^fcDP-^y^^ 
S^T'yT't, SBS^ffloW^7y-^gM1-5^r 

m^9->^m.i-, ^rik^K^pk<Dm^^s 

t'-Aif-5 1 ^ffl^fcDP-7y^>y'^ii9^WS^^ 
w t fcfc* t f 5 = > f» - * K*lft <9 Rlag^lEtg«Cfr 

T*fc9, 3^fa-^iitt*a*st^-t5fco-e, f§ss 
30 ^ToiiiiBOBR^-ctt^fryy-omr^^ttfll 

^(Dffi^7 7^jt^*^^-VfrffiV^fcfe, 10B 

[0 0 4 8] 9Kl|Bft©*Wtt, /d^5AJ 

Hfc = y tr a - |c i o T*]» SrWRi" 5 7 s b y* 7 A §r 
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[0050] (Embodiment 1) 

Fig. 1 is a block diagram of a speech recognition 
apparatus of Embodiment 1 of the present invention, which 
5 will hereinafter be described. 

[0051] In Fig. 1, reference numeral 1 denotes a 
microphone for collecting speech, reference numeral 2 
denotes an A/D converter, reference numeral 3 denotes an 
interface (I/F) , reference numeral 4 denotes a memory, 

1.0 reference numeral 5 denotes a CPU, reference numeral 6 
denotes a keyboard/display, reference numeral 7 denotes a 
CPU bus, reference numeral 8 denotes an I/F, reference 
numeral 9 denotes an output, reference numeral 10 denotes a 
recognition target dictionary set, reference numeral 11 

15 denotes a phoneme merged speech segment tree, reference 
numeral 12 denotes a rough speech segment tree, reference 
numeral 13 denotes a first half speech segment re-collation 
tree, reference numeral 14 denotes a speech segment re- 
collation tree, reference numeral 15 denotes a speech 

20 segment standard pattern, reference numeral 16 denotes a 
phoneme merged speech segment standard pattern, reference 
numeral 17 denotes a rough speech segment standard pattern, 
and reference numeral 18 denotes an accurate speech segment 
standard pattern. 

25 [0052] First, the phoneme merged speech segment tree 11 
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2 

corresponding to a recognition dictionary in Embodiment 1 
will be described with reference to Figs, 3, 4 and 5. 
[0053] As a unit of a standard pattern, a phoneme 
segment, a phoneme, a syllable, CV/VC (a consonant + a 
5 vowel/ a vowel + a consonant) , VCV, CVC and the like are 
conceivable. These minimum recognition units are referred 
to as speech segments. In the present embodiment, CV 
representing from the start of a consonant to the center of 
a vowel, VC representing from the center of a vowel to the 

10 end of the vowel, and W representing from the center of a 
vowel to the center of a vowel serve as base units. 
Although VC includes only a vowel section, it is defined 
differently depending on the subsequent consonant. 
[0054] For example, assume that the recognition target 

15 vocabulary includes 8 words: "kirihara", "kiryu", "chiri", 
"chiryu" , "meguro" , "memuro" , "nemuro" and " fuchu", and 
when they are represented by speech segment sequences, the 
results are as shown in Fig. 4. 

[0055] Fig. 3 is a diagram showing the above results in 

20 a simple tree structure. In the present embodiment, this 
is defined as a base speech segment tree. This is the same 
as the speech segment tree used in the prior art. Although 
speech segments are allocated to arcs here, they may also 
be allocated to nodes. Each node corresponding to an end 
25 of a vocabulary is made recognizable as the end of the 
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vocabulary. Such a node is defined as a leaf node. In Fig. 
3, the leaf nodes are represented by solid circles. 
Further, the depth of the tree is to be counted from its 
root (e.g., a first step, a second step •••)• 
5 [0056] In the phoneme merged speech segment tree 11, 
phonemes from a first to nth steps of the base phoneme tree 
are merged, whereby expansion of initial phonemes of words 
in the tree is suppressed. The phoneme merged speech 
segment tree is the same as the base speech segment tree 

10 subsequent from a (n+l)th step. 

[0057] Phoneme merging from the first to nth steps is 
performed by the following method. There are only five 
vowels in Japanese so that it is relatively easy to 
distinguish them. On the other hand, it is very difficult 

15 to distinguish consonants because the number of categories 
is large. Consonants are collectively merged for each 
phoneme group (e.g., unvoiced stop consonant, fricative 
consonant, voiced stop consonant groups and the like) , and 
consonants within the same phoneme group are not 

20 distinguished. That is, in each -consonant, phonemes are 
not distinguished, and they are distinguished only by the 
phoneme group such as the unvoiced stop consonant or 
fricative consonant group. This means that "kiryu" and 
"chiryu", which are different by one letter in initial 

25 phoneme, are to be collated without any distinction. 
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[0058] Since merging is performed between consonants 
having similar acoustic features in the same phoneme group, 
there is little error due to merging, and moreover, 
distinction between different phoneme groups is easy 
5 because their acoustic features are very different. 
Therefore, a correct candidate is hardly pruned so that 
deterioration of recognition performance is little. In the 
present embodiment, the consonants are divided into four 
categories as shown in Fig. 5. 

10 [0059] By merging phonemes, speech segments are also 
merged. Merging occurs to CV when its subsequent vowel is 
the same, and merging occurs to VC when its antecedent 
phoneme is the same. A speech segment obtained by merging 
phonemes for each phoneme group is defined as a phoneme 

15 merged speech segment. Examples of the merging method of 
the phoneme merged speech segment and its notation are 
shown in Fig. 6. 

[0060] Of the base speech segment tree, speech segments 
from the first to nth steps are converted to phoneme merged 

2 0 speech segments, whereby arcs to which the same phoneme 
merged speech segment is allocated are merged making it 
possible to obtain a tree with a reduced expansion in the 
vicinity of initial phonemes of words. This is the phoneme 
merged speech segment tree. 

25 [0061] Fig. 7 is a phoneme merged speech segment tree 
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obtained by merging speech segments from the first to third 
steps (n=3) of the base speech segment tree of Fig. 3 and 
converting them to phoneme merged speech segments. In the 
phoneme merged speech segment tree of Fig. 7, expansion of 
5 initial phonemes of words in the tree is suppressed, 
compared with the speech segment tree of Fig. 3. Supposing 
that n is 1, first speech segments of the initial phonemes 
of the words are merged, and supposing that n is oo, all 
the speech segments are to be merged. It is efficient if 
10 the size of n is determined to an extent that the quantity 
of calculation is within the real-time processing. In the 
phoneme merged speech segment tree, a plurality of words 
may be allocated to a leaf node as in the case of "kiryu" 
and "chiryu" . 

15 [0062] Next, operation of a speech recognition apparatus 

in Embodiment 1 of the present invention will be described 
with reference to a flowchart of Fig. 2. 

[0063] In Fig. 2, the speech segment standard pattern 15 
is found for each speech segment by learning in advance 

20 learned data in which a lot of speakers vocalized. The 
phoneme merged speech segment standard pattern 16 is found 
by learning all the learned data of speech segments that 
are merged. For example, a standard pattern for the speech 
segment/ {p, t, k, c} i/ can be obtained by learning all 

25 the learned data of /pi/, /ti/, /ki/, and /ci/. This is to 
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be found in advance for each of. all the phoneme merged 
speech .segments. 

[0064] In the present embodiment, assuming that the 
appearance probability of a feature parameter vector can be 
5 approximated by a sum of a plurality of Gaussian 
distributions, (which is referred to as a mixed 
distribution) , average value vectors and covariance 
matrices in the Gaussian distribution are found from the 
learned data for each frame of the standard pattern. 
10 [0065] The phoneme merged speech segment tree 11 is made 
in advance from the recognition target vocabulary set 10 in 
development processing of the tree S01. 

[0066] First, in acoustic analysis processing S01, an 
inputted unknown speech signal is converted to D feature 

15 parameters for each analysis time (hereinafter referred to 
as frame) . Examples of the feature parameters include LPC 
cepstrum coefficient and LPC mel-cepstrum coefficient in 
accordance with linear prediction analysis; mel-LPC 
cepstrum coefficient in accordance with mel-linear 

20 prediction analysis; mel-f requency cepstrum coefficient 
(MFCC) in accordance with mel-scale filter bank; and the 
like, and any feature parameter may be used as long as it 
is suited for speech recognition. 

[0067] In collation processing S02, in accordance with 

25 the phoneme merged speech segment tree 11, a feature 
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parameter time series for the unknown inputted speech from 
the acoustic analysis processing SOI is collated with that 
for the standard pattern while connecting the phoneme 
merged speech segment standard pattern 16 and the speech 
5 segment standard pattern 15. The collation is performed by 
DP matching using input frame synchronous beam search. 
Since the method for DP matching and the method for beam 
search are the same as those of the conventional example, 
their description is omitted. The collation is referred to 
10 as first collation. 

[00 68] Although the phoneme merged speech segment tree 
11 is to be made in advance in the present embodiment, it 
may also be dynamically developed while performing beam 
search DP. 

15 [0069] A conceptual diagram as to DP using input frame 
synchronous beam search is shown in Fig. 9. In Fig. 9, an 
axis of abscissas represents a frame of inputted speech, 
while an axis of ordinates represents a frame of a speech 
segment standard pattern connected in accordance with the 

20 phoneme merged speech segment tree. The axis of ordinates 
serving as a dictionary is in a tree shape. DP matching 
between the inputted speech and the tree-shaped dictionary 
is to calculate scores while finding an optimal path of the 
inputted speech pattern and the standard pattern on the 

25 tree-shaped DP screen. In the tree-shaped DP screen, 



phonemes from the first to nth steps have been merged, so 
that the expansion of branches is suppressed. 
[0070] In DP matching, pruning DP paths is carried out 

synchronously with the input frame by beam search. Since 
5 the number of candidate lattice points remaining in a beam 
is far smaller than the number of all the lattice points on 
the DP screen, the DP screen is not necessary for an actual 
memory, which is a virtual one. 

[0071] After a while from the start of vocalization, 

10 cumulative scores of the DP paths in the dictionary, which 
are not similar to the speech content, become small values 
enough to be pruned and thus the number of candidate 
lattice points is drastically reduced. Therefore, the 
reduction of the number of candidate lattice points so far 

15 leads to the reduction of the total calculation quantity. 

As in the first embodiment, by suppressing the expansion in 
the vicinity of initial phonemes of words of the tree, the 
number of candidate lattice points remaining in the beam in 
the vicinity of the start of speech can greatly be reduced. 

20 [0072] In determination processing S03, a leaf node 

(maximum likelihood leaf node) is found, and determination 
of whether or not a word corresponding thereto is uniquely 
determined is performed. If it is uniquely determined (Y) , 
namely in the case where there is only one word 

25 corresponding to the maximum likelihood leaf node, the word 
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is outputted as a recognition result. If it is not 
uniquely determined (N) , namely in the case where there are 
a plurality of words corresponding to the maximum 
likelihood leaf node, the recognition result is determined 
5 in the following method. 

[0073] In re-collation candidate extraction processing 

SOS, re-collation candidates are selected. In the present 
embodiment, the re-collation candidates are those 
vocabularies corresponding to the maximum likelihood leaf 
10 node. As another method, there is also a method in which 
all the vocabularies corresponding not only to the maximum 
likelihood leaf node but also to the top K leaf nodes of 
the cumulative scores remaining in the beam. 

[0074] Next, in re-collation tree development processing 

15 S06, a first to nth steps of a speech segment tree, where 
phoneme merging is not performed with respect to re- 
collation candidate vocabularies, are developed. This 
speech segment tree serves as a first half speech segment 
re-collation tree 13. In the first half speech segment re- 
20 collation tree 13, a recognition vocabulary is uniquely 
determined in the first to nth steps. Thus, the vocabulary 
is allocated to an end node of the nth step. An example of 
the first half speech segment re-collation tree 13, where 
re-collation candidates are three words of "meguro", 
25 "memuro" and "nemuro", i.e., n=3, is shown in Fig. 8. 
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[0075] In the present embodiment, in performing the 
collation processing S02 by DP matching beforehand, it is 
required to store an input frame position Fs corresponding 
to a start node of the first step and an input frame 
5 position Fe corresponding to an end node of the nth step. 

[0076] In first half re-collation processing S04, an 
input speech from the frame Fs to the frame Fe is re- 
collated with a speech segment standard pattern 15 
connected in accordance with the first half speech segment 

10 re-collation tree 13 by DP matching. In the case of re- 
collation, since the recognition target vocabulary is small, 
beam search is not necessarily required to be performed. As 
a result of the re-collation, a word having the highest 
cumulative score, which corresponds to the end node of the 

15 nth step of the re-collation tree, is outputted as a 
recognition result. 

[0077] In the case of the method in which all the words 

corresponding to the top K leaf nodes of the cumulative 
scores remaining in the beam serve as re-collation 

20 candidates, a sum S of a score of a first half of 
vocalization, namely a score Sa from an input frame Fs to a 
frame Fe, which is found as a result of the re-collation, 
and a score of a second half of vocalization, namely a 
score Sb from an input frame Fe+1, which is found as a 

25 result of the first collation, to an end frame of 
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vocalization is found for each re-collation candidate 
vocabulary. Then, a word with the largest S is determined 
as a recognition result. 

[0078] In the present embodiment, although phoneme 
5 merging is performed collectively from the first to nth 
steps, it may be performed in all the steps. Further, a 
portion of the tree, which is densely divided, may be 
partially modified by further dividing and the like. 
Instead of collectively performing re-collation to the end 

10 node of the nth step, re-collation may be performed to the 
node in which the word is uniquely determined. In the case 
where phoneme merging is performed in all the steps, it is 
not necessary to use the speech segment standard pattern 15 
in which phonemes are not merged. 

15 [0079] In the present embodiment, if there is only one 
words corresponding to the maximum likelihood leaf node, 
re-collation is not performed. In this case also, not only 
the maximum likelihood leaf node but also all the words 
corresponding to the top K leaf nodes of the cumulative 

20 scores may serve as re-collation candidates. 

[0080] As described above, according to. the present 
embodiment, by using the phoneme merged speech segment tree 
in which consonants belonging to the same phoneme group are 
merged as to speech segments from the first to nth steps, 

25 the quantity of calculation in the first collation can 
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greatly be reduced, and, even if re-collation is performed, 
the total quantity of calculation can greatly be reduced. 
[0081] Further, in this method, since similar phonemes 

are recognized without making any distinction, the 
5 possibility that correct candidates are failed to be 
selected is low. Thus, the quantity of calculation can be 
reduced without deteriorating recognition performance. 
[0082] Furthermore, in the present embodiment, storing 
the input frame position Fs corresponding to the start node 
10 of the first step and the input frame position Fe 
corresponding to the end node of the nth step, re-collation 
is performed only between the Fs and the Fe, so that the 
quantity of calculation required for re-collation is so 
small. 

15 [0083] (Embodiment 2) 

Next, operation of a speech recognition apparatus 
according to Embodiment 2 of the present invention will be 
described with reference to a flowchart of Fig. 10. 
[0084] What differs from Embodiment 1 is that the first 

20 half speech segment re-collation tree 13 and the first half 
re-collation processing S04 are changed to a speech segment 
re-collation tree 14 and re-collation processing S21, 
respectively. Different from Embodiment 1, the re- 

collation speech segment tree 14 is a tree representing not 

25 only from a first to nth steps but also to an end step of a 
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word. 

[0085] Since operation of Embodiment 2 is almost the 
same as that of Embodiment 1, only different parts will be 
described. 

5 [0086] In Embodiment 1, re-collation is performed only 

on the input section corresponding to the first to nth 
steps, where phoneme merging of the phoneme merged speech 
segment tree was performed in the first collation. On the 
other hand, in Embodiment 2, re-collation is performed on 

10 all the vocalization section. 

[0087] In re-collation tree development processing S09, 

a speech segment tree, where phoneme merging is not 
performed on re-collation candidate words, is developed. 
This speech segment tree is referred to as the re-collation 

15 speech segment tree 14. The re-collation speech segment 
tree 14 is a speech segment tree representing not from the 
first to nth steps but to the end step of the word. 
[0088] An example of the re-collation speech segment 
tree, where re-collation candidates are three words, i.e., 

2 0 "meguro", "memuro" and "nemuro", will be described in Fig. 
11. 

[0089] In the present embodiment, in collation 

processing S02, it is not required to store an input frame 
position corresponding to a start node of the first step 
25 and an input frame position corresponding to an end node of 
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the nth step, 

[0090] In re-collation processing S21, an input speech 
from the start of vocalization to the end of vocalization 
is re-collated with a speech segment standard pattern 15 
5 connected in accordance with the re-collation speech 
segment tree 14 by DP matching. In the case of re- 
collation, since the recognition target vocabulary is small 
as in Embodiment 1, it is not necessarily required to 
perform beam search. 
10 [0091] As a result of the re-collation processing S21, a 

vocabulary corresponding to a leaf node of the re-collation 
tree, which had the highest cumulative score, is outputted 
as a recognition result. 

[0092] As described above, according to Embodiment 2, in 
15 the case where the input frame position corresponding to 
the end node of the nth step was not an optimal position in 
the first collation, by performing re-collation from the 
start to the end of the vocalization section, more accurate 
collation can be performed. Thus, recognition performance 
20 is further improved compared with Embodiment 1. 

[0093] In Embodiment 2, since it is not required to 

store the input frame position Fs corresponding to the 
start node of the first step and the input frame position 
Fe corresponding to the end node of the nth step, the first 
25 recognition processing and memory capacity are less than 
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those of Embodiment 1. 

[0094] As in Embodiment 2, in the case where the 
vocalization section is re-collated from its start to end, 
a distance scale completely different from the one used for 
5 the first collation may be used. Therefore, in performing 
re-collation, higher recognition performance can also be 
achieved using a method by which higher recognition can be 
achieved with only several words. 
[0095] (Embodiment 3) 

10 Next, operation of a speech recognition apparatus 

according to Embodiment 3 of the present invention will be 
described with reference to a flowchart of Fig. 12 
[0096] What differs from Embodiment 1 is that the 
phoneme merged speech segment tree 11, the phoneme merged 

15 speech segment standard pattern 16, the speech segment 
standard pattern 15 are changed to a rough speech segment 
tree 12, a rough speech segment standard pattern 17, an 
accurate speech segment standard pattern 18, respectively, 
and that the determination processing S03 is not required. 

20 [0097] The accurate speech segment standard pattern 18 
is the same as the speech segment standard pattern 15 of 
Embodiment 1. In Embodiment 3, a normal speech segment is 
referred to as an accurate speech segment in order to 
contrast it with a rough speech segment. 

25 [0098] The rough speech segment tree 12 and the rough 
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speech segment standard pattern 17 will hereinafter be 
described. The rough speech segment is defined as the one 
that is obtained by reducing the accuracy of the speech 
segment standard pattern. The following two methods are 
5 conceived as a method therefor. 

[0099] The first is a method in which the quantity of 
distance calculation required per one rough speech segment 
is reduced, compared with that required per one accurate 
speech segment. Specifically, a method for reducing the 

10 number of frames of the rough speech segment pattern, a 
method for reducing the mixture number of the Gaussian 
distribution, a method for making the covariance matrices 
of the Gaussian distribution common to reduce the number of 
kinds of covariance matrices, and the like are conceived. 

15 In this method, the shape of the speech segment tree is not 
changed. 

[0100] The second is a method for merging speech 
segments of different phonological environment in a range 
in which the recognition result is uniquely determined. By 

20 this method also, since arcs and nodes of the tree are 
reduced, the quantity of calculation can be reduced. For 
example, the case where VCs, whose vowel portions are the 
same, are merged with one speech segment even if their 
subsequent consonants are different and the like are 

25 conceived. In this method, the shape of the speech segment 
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tree may be changed. In the case where phonemes are used 
as units of the speech segment, speech segments are 
determined to be different in many cases depending on 
phonemic environment around them, but in the case where 
5 they have the same center phoneme, they are merged with one 
speech segment, whereby the expansion of the tree can 
greatly be suppressed. As a matter of course, if they have 
the same center phoneme, the recognition result is 
naturally and uniquely determined even if merging is 

10 performed. 

[0101] In Embodiment 3, both of the method for reducing 
the number of frames of the standard pattern and the method 
in which VCs having the same vowel portion and different 
subsequent consonants are merged are used. The former is 

15 represented by affixing a bar on a speech segment symbol, 
while the latter is represented by replacing a consonant 
portion with an asterisk. 

[0102] Fig. 13 is a rough speech segment tree obtained 

by converting speech segments from the first to third steps 

20 (n=3) of the base speech segment tree of Fig. 3 to rough 

speech segments. Speech segments after the fourth step are 
the same as those of the base speech segment tree. The 
shape of the rough speech segment tree is slightly 
different from Fig. 2. VC-merging may also be limited to 

25 the only case where the subsequent consonant belongs to the 
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same phoneme group, 

[0103] The rough speech segment standard pattern 17 is 
provided in advance by learning. Based on the number of 
frames of the standard pattern, learning is performed by 
5 reducing the number of frames of the standard pattern to a 
half of the number of original frames- Moreover, as to VCs, 
learning is performed from all the learned data of the 
speech segments having the same vowel portion. For example, 

the standard pattern for the speech segment /e*/ can be 
10 obtained by learning all the learned data of the speech 
segments whose vowel portions are /e/ and the subsequent 
consonants are different, i.e., /em/, /en/, /eg/, /eb/, 
[0104] Operation of Embodiment 3 is almost the same as 
that of Embodiment 1 and thus only different portions will 
15 be described. 

[0105] In collating processing S02, in the same manner 

as in Embodiment 1, a feature parameter time series of an 
unknown inputted speech is collated with that of the 
standard pattern while connecting the rough speech segment 
20 standard pattern 17 and the accurate speech segment 
standard pattern 18 in accordance with the rough speech 
segment tree 12. 

[0106] After performing the collation, re-collation 
candidates are extracted in re-collation candidate 
25 extraction processing S05. In the present embodiment, K 
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vocabularies corresponding to the top K leaf nodes of the 
cumulative scores serve as re-collation candidates. In the 
same manner as in Embodiment 1, the first half speech 
segment re-collation tree 13 is developed for each re- 
5 collation candidate, and a first half of vocalization is 
collated with the accurate speech segment standard pattern 
in the first half re-collation processing S04. 
[0107] A sum S of a score Sa of a first half of 
vocalization, which is found as a result of the re- 

10 collation, and a score Sb of a second half of vocalization, 
which is found as a result of the first re-collation, is 
found for all the candidate vocabularies- Then, a 

vocabulary with the largest S is determined as a 
recognition result. 

15 [0108] In the present embodiment, although phoneme 

merging is performed collectively from the first to nth 
steps, a portion of a tree, which is densely divided, may 
be partially modified by further dividing and the like. 
Instead of collectively performing re-collation to the end 

20 node of the nth step, re-collation may be performed until 
the node in which the word is uniquely determined. 
[0109] As described above, according to Embodiment 3, by 
using the rough speech segment tree whose accuracy of the 
speech segment standard pattern is reduced, the calculation 

25 quantity required for the re-collation of the rough speech 
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segments is small and thus the calculation quantity in the 
first collation can greatly be reduced. Even if re- 
collation is performed, the total calculation quantity can 
be reduced. 

5 [0110] Since rough collation is performed in a part 
where the calculation quantity is large, while accurate 
collation is performed in a part where the calculation 
quantity is small a little while after the start of 
vocalization, the efficiency is high. 



